A platform to programmatically author, schedule and monitor workflows


setup

# install to default $AIRFLOW_HOME folder ~/airflow
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page
import airflow
airflow.__version__
#  '1.10.12'

basics

terms

  • task/operator - a defined unit of work
  • task instance - an individual run of a single task
  • dag - directed acyclic graph - a set of tasks with explicit execution order
  • dag run - individual execution/run of a DAG

components

  • web server - a gui where you can track job status and read logs from a remote file store
  • scheduler - responsible for scheduling jobs
  • executor - the mechanism that gets the tasks done
  • metadata database - powers how the other components interact and stores the Airflow states

operators

PythonOperator

def print_context(ds, **kwargs):
    pprint(kwargs)
    print(ds)
    return 'Whatever you return gets printed in the logs'

run_this = PythonOperator(
    task_id='print_the_context',
    provide_context=True,
    python_callable=print_context,
    dag=dag,
)

BashOperator

t1 = BashOperator(task_id='print_date',
    bash_command='date,
    dag=dag) 

EmailOperator

SimpleHttpOperator

MySqlOperator